Method and apparatus for identifier retrieval

ABSTRACT

A method for identifier retrieval. The method can include the steps of: extracting candidate identifiers from a data source according to a source identifier; obtaining a profile of the source identifier and profiles of the candidate identifiers from the data source; and selecting a target identifier associated with the source identifier from the candidate identifiers according to the profile of the source identifier and the profiles of the candidate identifiers. The method may efficiently, accurately and rapidly find a target identifier associated with a source identifier.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of and claims priority from U.S.application Ser. No. 13/471,515 filed May 15, 2012, which in turn claimspriority under 35 U.S.C. 119 from Chinese Application 201110145948.2,filed May 18, 2011, the entire contents of both applications areincorporated herein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

Embodiments of the present invention relate to the field of informationretrieval, and more specifically, to a method and apparatus foridentifier retrieval.

2. Description of the Related Art

In the current era of competition, it is important to obtain effectivecompetitive information in various aspects, such as business, andincreasingly more companies consider and synthesize competitiveinformation when composing a business strategy. Traditionally, peoplehave manually collected the desired competitive information viamarketing surveys.

With the increasing development of society and information technology,the Internet provides more and more information to people, and at thesame time, people transfer more and more information to the Internet.Much information is organized in text, such as news, introductoryarticles, reviews, etc. A considerable amount of content of the textualinformation is associated with categories of named entities, such asproducts, persons, organizations, etc. For example, many introductoryarticles and commentary articles on Internet hardware or softwarewebsites contain a large quantity of product information.

However, it is quite time-consuming and also impractical to manuallyobtain competitive information of companies from the Internet thatcontains mass data.

For example, when a user wants to know which companies are competitorsof company A or which products are in a competitive relation with agiven product of company A, he/she may use a source identifier torepresent a product to be queried, and may retrieve a target identifierrepresenting a competitive product by means of some reviews orintroductory information on the Internet. At this point, if mass data onthe Internet are browsed manually, it is impossible to accomplish suchretrieval efficiently, accurately and rapidly.

BRIEF SUMMARY OF THE INVENTION

In order to overcome these deficiencies, the present invention providesa computer-implemented method for identifier retrieval, including:extracting candidate identifiers from a data source according to asource identifier; obtaining a profile of the source identifier andprofiles of the candidate identifiers from the data source; andselecting a target identifier associated with the source identifier fromthe candidate identifiers according to the profile of the sourceidentifier and the profiles of the candidate identifiers.

According to another embodiment, the present invention provides anapparatus for identifier retrieval, including: extracting meansconfigured to extract candidate identifiers from a data source accordingto a source identifier; obtaining means configured to obtain a profileof the source identifier and profiles of the candidate identifiers fromthe data source; and selecting means configured to select a targetidentifier associated with the source identifier from the candidateidentifiers according to the profile of the source identifier and theprofiles of the candidate identifiers.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

As the present invention is apprehended more thoroughly, other objectsand effects of the present invention will become more apparent andeasier to understand by means of the following description withreference to the accompanying drawings, wherein:

FIG. 1 is a flowchart of a method for identifier retrieval according toone embodiment of the present invention;

FIG. 2A is a flowchart of a method for identifier retrieval according toanother embodiment of the present invention;

FIG. 2B is a continuation of the flowchart in FIG. 2A;

FIG. 3A is an example that can be used as a profile, according to anembodiment of the present invention

FIG. 3B is an example that cannot be used as a profile according to anembodiment of the present invention;

FIG. 4 is a block diagram of an apparatus for identifier retrievalaccording to one embodiment of the present invention; and

FIG. 5 is structural block diagram of a computer system in whichembodiments of the present invention can be implemented.

Like numerals represent the same, similar or corresponding features orfunctions throughout the figures.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

More detailed description will be presented below to embodiments of thepresent invention by referring to the figures. It is to be understoodthat the figures and embodiments of the present invention are merely forillustration, rather than to limit the scope of protection of thepresent invention.

The flowcharts and block diagrams in the figures illustrate the system,methods, as well as architecture, functions and operations executable bya computer program product according to various embodiments of thepresent invention. In this regard, each block in the flowcharts or blockdiagrams may represent a module, a program segment, or a part of code,which contains one or more executable instructions for performingspecified logic functions. It should be noted that in some alternativeimplementations, functions indicated in blocks may occur in an orderdiffering from the order as shown in the figures. For example, twoblocks shown consecutively can be performed in parallel substantially orin an inverse order sometimes, which depends on the functions involved.It should be further noted that each block and a combination of blocksin the block diagrams or flowcharts can be implemented by a dedicated,hardware-based system for performing specified functions or operationsor by a combination of dedicated hardware and computer instructions.

Technical terms used in embodiments of the present invention are firstexplained for the purpose of clarity.

1. Data Source

A data source can be user generated content (UGC), such as commentaryinformation, news, a microblog, a blog, a bulletin board system (BBS)and other content on the Web with respect to a certain product orcompany, or any other content that can be browsed or viewed by users viaa communication network.

In addition, a data source can be an ontology. An ontology can be usedto capture knowledge in a related domain, provide common understandingof knowledge in the domain, determine vocabulary or concepts commonlyrecognized in the domain, and provide explicit definition of mutualrelationships among these concepts from formalized patterns at differentlevels. Semantically speaking, relations between concepts can include:“part-of,” which represents a relation between part and entirety ofconcepts; “kind-of,” which represents an inheritance relation betweenconcepts; “instance-of,” which represents a relation between an instanceof a concept and the concept; and “attribute-of,” which represents thata certain concept is an attribute of another concept. In practicalapplications, relations between concepts are not limited to theabove-enumerated four relations; rather, corresponding relations can bedefined according to specific conditions of a domain. Ontologies thatare currently in common use include, for example, Wordnet, Framenet,GUM, SENSUS, Mikrokmos, etc. Among them, Wordnet, an English lexiconbased on psychological language rules, organizes information in the unitof synsets (sets of interchangeable synonyms in specific context).Framenet, an English lexicon, provides relatively strong semanticanalysis capabilities by using a description frame referred to as FrameSemantics and currently is developed as FramenetII. GUM, naturallanguage-oriented processing, supports multilingual processing andincludes basic concepts and conceptual organization forms independent ofvarious specific languages. SENSUS, also natural language-orientedprocessing, provides conceptual mechanisms for machine translation andincludes more than 70,000 concepts. Mikrokmos, also naturallanguage-oriented processing, supports multilingual processing andrepresents knowledge by using an intermediate language TMR amonglanguages.

In addition, a data source can be a pre-established product knowledgebase, including products' brand names, product models, companies owningthem, product categories, and other product attribute information, etc.

2. Named Entity

A named entity (hereinafter referred to as an “entity” for short) is animportant language unit carrying information in text and plays asignificant role in various domains such as information abstraction,machine translation, automatic abstracting, etc. Named entityrecognition (NER) mainly refers to recognizing named denotative items ofentity concepts in data sources. Categories of named entities mainlyinclude “persons,” “locations,” “organizations,” “time,” “quantity,”“products,” etc.

3. Identifier

An identifier may represent an entity by using, for example, theentity's full name, abbreviated name, English abbreviation and the like.An identifier can be inputted by a user directly, obtained from a datasource according to an inputted object, or determined according to namedentity recognition.

4. Object

An object can be an entity corresponding to an identifier. For example,when an identifier represents a product, an object may represent acompany to which the product belongs, which can be the company's fullname, abbreviated name, English abbreviation and the like.

An identifier may correspond to an object. In the present invention, oneidentifier may correspond to one or more objects, while one object mayalso correspond to one or more identifiers. Specifically, one productmay belong to one or more companies or be a cooperative result of twocompanies, i.e., the product may belong to two companies. Meanwhile, onecompany may have one or more products, thereby having one or moreproducts corresponding thereto.

In one embodiment of the present invention, a computer-implementedmethod for identifier retrieval is presented. In this embodiment,candidate identifiers are extracted from a data source according to asource identifier and a profile of the source identifier, and profilesof the candidate identifiers are obtained from the data source, andfinally, an identifier associated with the source identifier is selectedfrom the candidate identifiers as a target identifier according to theobtained profile of the source identifier and profiles of the candidateidentifiers.

FIG. 1 illustrates a flowchart of a method for identifier retrievalaccording to one embodiment of the present invention.

In step S101, candidate identifiers are extracted from a data sourceaccording to a source identifier.

In this step, named entity recognition can be first performed on thedata source, and then identifiers that belong to the same entitycategory as the source identifier can be extracted as candidateidentifiers from the recognized named entities.

In step S102, a profile of the source identifier and profiles of thecandidate identifiers are obtained from the data source.

It is possible to search the data source for information related to thesource identifier so as to be used as a profile of the sourceidentifier. For example, it is possible to search the profile of thesource identifier for descriptive information on the source identifier,and to update the profile of the source identifier with the descriptiveinformation on the source identifier.

Also it is possible to search the data source for information related tothe candidate identifiers so as to be used as profiles of the candidateidentifiers. For example, it is possible to search the profiles of thecandidate identifiers for descriptive information on the candidateidentifiers, and to update the profiles of the candidate identifierswith the descriptive information on the candidate identifiers.

In step S103, a target identifier associated with the source identifieris selected from the candidate identifiers according to the profile ofthe source identifier and the profiles of the candidate identifiers.

An identifier associated with the source identifier can be selected as atarget identifier from the candidate identifiers by calculating asimilarity between the source identifier and each of the candidateidentifiers and then comparing the similarity with a predeterminedthreshold. The predetermined threshold can be obtained according toexperience, or preset or obtained by those skilled in the art in anyother proper manner.

The similarity between the source identifier and a candidate identifiercan be calculated by various approaches. For example, keyword(s)(hereinafter referred to as “source keyword(s)”) can be extracted fromthe profile of the source identifier, then keywords (hereinafterreferred to as “candidate keyword(s)”) can be extracted from the profileof a candidate identifier, and finally, the similarity is calculatedaccording to the source keyword(s) and the candidate keyword(s). Foranother example, the profile of the source identifier can be directlycompared with the profile of the candidate identifier by using, forexample, a comparison approach for two sentences or a comparisonapproach for two paragraphs to calculate the similarity between thesource identifier and the candidate identifier according to the profileof the source identifier and the profile of the candidate identifier.

In another embodiment of the present invention, a temporal order betweenthe source identifier and the candidate identifiers can be determinedbased on the profile of the source identifier and the profiles of thecandidate identifiers; a target identifier associated with the sourceidentifier can be selected from candidate identifiers, when the temporalorder meets a predetermined requirement.

Then, the flow of FIG. 1 ends.

In one embodiment of the present invention, before step S101, a sourceobject input by a user can be received, and an identifier correspondingto the source object is looked up in the data source and subsequentlyused as the source identifier in steps S101 to S103.

In one embodiment of the present invention, after step S103, a sourceobject corresponding to the source identifier and a target objectcorresponding to the target identifier can be determined, and thedetermined source object is associated with the determined targetobject.

FIGS. 2A and 2B illustrate a flowchart of a method for identifierretrieval according to another embodiment of the present invention.

In step S201, named entities are recognized from a data source.

Typically named entity recognition refers to recognizing nameddenotative items of entity concepts in a data source. As describedabove, categories of named entities mainly include “persons,”“locations,” “organizations,” “time,” “quantity,” “products”, etc. Thus,entities of categories such as persons, locations, organizations, time,quantity, products, etc. can be obtained after performing named entityrecognition to the data source.

In step S202, an identifier belonging to the same entity category as thesource identifier is extracted as a candidate identifier from therecognized named entities.

In this step, it is possible to first judge an entity category to whichthe source identifier belongs, and then according to the entitycategory, determine a candidate identifier from the entities recognizedin step S201.

In one embodiment of the present invention, suppose the sourceidentifier is “DB2,” which represents a product of InternationalBusiness Machine (IBM®) Corporation. In step S202, first it can bejudged that the source identifier “DB2” represents an entity in thecategory of “products”; then, an entity belonging to the productcategory can be looked up in the entities recognized in step S201 andused as a candidate identifier. In this embodiment, suppose thecandidate identifiers include three entities in the category of“products,” namely “SQL Server®” “Windows®,” and “iPhone®.”

It should be noted that in the present invention, the source identifieris not limited to only include entities in the product category, but canbe applicable to entities in other categories such as persons,locations, organizations, time, quantity, products, etc.

For example, in another embodiment of the present invention, suppose thesource identifier is “Jobs,” at which point the source identifierrepresents the leader of Apple Inc. In step S202, first it can be judgedthat the source identifier “Jobs” is an entity in the “persons”category; then, an entity belonging to the “persons” category can belooked up in the entities recognized in step S201 and used as acandidate identifier. In this embodiment, suppose the candidateidentifiers include three entities in the “persons” category, namely“Zhang San,” “Bill Gates,” and “Obama.”

In step S203, information related to the source identifier is searchedfor in the data source to be used as a profile of the source identifier.

In embodiments of the present invention, information related to thesource identifier “DB2” can be sentences, fragments, paragraphs,articles, or other types of content, which contain relations ofcomparison, enumeration, parallel, competition and so on. For example,it can be determined from the expression “Such as DB2, A, B and C” thatDB2 is in a parallel or enumeration relation with A, B and C, so contentcontaining the expression “Such as DB2, A, B and C” can be determined asinformation related to the source identifier “DB2” and further used as aprofile of the source identifier “DB2.” Besides, it can be determinedfrom both of the expressions “DB2 vs. A” and “Which one is better, DB2or A?” that DB2 is in a comparison or competition relation with A, socontent containing “DB2 vs. A” or “Which one is better, DB2 or A?” mayalso be determined as information related to the source identifier “DB2”and further used as its profile.

FIG. 3A illustrates an example that can be used as a profile. In thisexample, “DB2 VS PostgreSQL” is contained, which represents that DB2 isin a comparison or competition relation with PostgreSQL, so thisfragment can be used as a profile of the identifier “DB2.” On the otherhand, if “PostgreSQL” is also regarded as an identifier, then thefragment illustrated in FIG. 3A can be used as a profile of theidentifier “PostgreSQL.”

FIG. 3B illustrates an example that cannot be used as a profile. In thisexample, “DB2” and “Sun Microsystems®” are not in a parallel orenumeration relation; rather, they have little relevance. Hence, thisfragment cannot be used as a profile of “DB2” or “Sun Microsystems®.”

In one embodiment of the present invention, the source identifier'sprofile obtained in step S203 can be optimized such that the optimizedprofile is more helpful to accurately determine a target identifierassociated with the source identifier. For example, it is possible tolook up descriptive information on the source identifier in the profileof the source identifier and update the profile of the source identifierwith the descriptive information, so that the profile of the sourceidentifier is optimized.

There are a number of implementing approaches to look up descriptiveinformation in the profile of the source identifier. In one example, afocused named entity recognition or other filtering approach can befirst performed on the profile to remove from the profile content thathas little relevance with the source identifier, whereby a subset S1 ofthe profile is obtained; then, the subset S1 is used as descriptiveinformation to replace the current profile of the source identifier. Inanother example, a focused named entity recognition or other filteringapproach can be first performed on the profile to remove from theprofile content that has little relevance with the source identifier,whereby a subset S1 is obtained; next, a subset S2, i.e., introductoryor descriptive content regarding the source identifier, can be detectedfrom the subset S1 by using a classification algorithm such as NaiveBayes, support vector product, KNN, etc.; finally, the subset S2 is usedas descriptive information to replace the current profile of the sourceidentifier.

In step S204, information related to the candidate identifiers issearched for in the data source to be used as profiles of the candidateidentifiers.

Like the source identifier's profile in step S203, information relatedto a candidate identifier can be sentences, fragments, paragraphs,articles, or other types of content, which contain relations ofcomparison, enumeration, parallel, competition and so on.

In the foregoing embodiment, supposing the candidate identifiers includethree entities in the product category, namely “SQLServer®,” “Windows®,”and “iPhone®,” then in step S204, respective information associated withthe three candidate identifiers is searched for in the data source andused as profiles of the three candidate identifiers respectively.

In one embodiment of the present invention, the candidate identifier'sprofile obtained in step S204 can be optimized such that the optimizedprofile is more helpful to accurately determine a target identifierassociated with the source identifier. For example, it is possible tolook up descriptive information on the candidate identifier in theprofile of the candidate identifier and update the profile of thecandidate identifier with the descriptive information, so that theprofile of the candidate identifier is optimized.

There are a number of implementing approaches to look up descriptiveinformation in the profile of the candidate identifier. In one example,first, a focused named entity recognition or other filtering approachcan be performed on the profile to remove from the profile content thathas little relevance with the candidate identifier, whereby a subset S1of the profile is obtained; then, the subset S1 is used as descriptiveinformation to replace the current profile of the candidate identifier.In another example, first, a focused named entity recognition or otherfiltering approach can be performed on the profile to remove from theprofile content that has little relevance with the candidate identifier,whereby a subset S1 is obtained; next, a subset S2, i.e., introductoryor descriptive content regarding the candidate identifier, can bedetected from the subset S1 by using a classification algorithm such asNaive Bayes, support vector product, KNN, etc.; finally, the subset S2is used as descriptive information to replace the current profile of thecandidate identifier.

In step S205, source keyword(s) is/are extracted from the profile of thesource identifier.

Various keyword extracting approaches that are known in the art can beused to perform step S205. Known keyword extracting algorithms includefrequency or rule-based keyword extraction, such as a statistics-basedapproach and a rule-based approach. Among them, the statistics-basedapproach can be easily implemented without a complex training process,for example, an approach based on word co-occurrence; and the rule-basedapproach trains discrete eigenvalues of phrases by using, for example,Naive Bayes technique to obtain weights of a model. Known keywordextracting algorithms further include keyword extraction based onsemantic part-of-speech features, which can extract keywords with arelatively high accuracy rate, for example, an approach based on naturallanguage understanding, referring to “Zhang Yingying et al., ChineseKeyword Extracting Algorithm Based on Synonyms Chain, ComputerEngineering, 2010, 36(19): 93-95,” “Zhang Hong, Keyword ExtractingAlgorithm Based on Automatic Text Classification, 2009, 35(12):145-147,” “Medelyan O, Witten I H. Thesaurus Based Automatic KeyphraseIndexing[C]//Proc. of the Joint Conference on Digital Libraries. ChapelHill, N.C., USA: [s. n.], 2006: 296-297,” or “Ercan G, Ciekli I. UsingLexical Chains for Keyword Extraction[J]. Information Processing andManagement, 2007, 43(6): 1705-1714,” etc.

In one embodiment of the present invention, when the source identifierrepresents an entity in the product category, the source keyword can be,for example, one or more keywords in the profile of the sourceidentifier that are used for describing information such as productmodel, series, technical parameter, occurrence frequency, etc.

In another embodiment of the present invention, when the sourceidentifier represents an entity in the “persons” category, the sourcekeyword can be, for example, one or more keywords in the profile of thesource identifier that are used for describing information such asposition, diploma, profession, service period, occurrence frequency,etc.

In step S206, candidate keyword(s) is/are extracted from the profile ofthe candidate identifier.

This step is implemented in a similar way to step S205. The differenceis that the candidate keyword is one or more keywords in the profile ofthe candidate identifier, i.e., coming from a different source otherthan the source keyword.

In step S207, the similarity between the source identifier and thecandidate identifier is calculated according to the source keyword(s)and the candidate keyword(s).

The similarity between the source identifier and the candidateidentifier can be obtained by various similarity calculating approaches.In one embodiment of the present invention, a vector with the sourcekeyword can be obtained according to the source keywords obtained instep S205, which is referred to as a source vector; likewise, a vectorwith the candidate keyword can be obtained according to the candidatekeywords obtained in step S206, which is referred to as a candidatevector. According to the obtained source vector and the candidatevector, the similarity between them can be calculated by calculating thecosine angle therebetween.

Further, the similarity between the source identifier and the candidateidentifier can be calculated by using a similarity calculating methodsuch as the Davis coefficient, Chi-square, log likelihood ratio, F1measure, and the like.

In step S208, it is judged whether the similarity calculated in stepS207 is greater than a predetermined threshold or not. If yes, the flowproceeds to step S209; if not, the flow ends.

The predetermined threshold used for comparison with the similarity ascalculated in step S207 can be obtained in various manners. For example,the predetermined threshold can be obtained according to experience, orcan be preset or obtained by those skilled in the art in any otherproper manner.

In the embodiment described according to step S202, suppose the sourceidentifier is product “DB2” of IBM® Corporation, and the candidateidentifier recognized in step S202 are “SQLServer®,” “Windows®,” and“iPhone®.” Suppose it is calculated in step S207 that the similaritybetween the source identifier “DB2” and the first candidate identifier“Windows®” is 0.2, the similarity between the source identifier “DB2”and the second candidate identifier “iPhone®” is 0.1, and the similaritybetween the source identifier “DB2” and the third candidate identifier“SQLServer®” is 0.8. In addition, suppose a predetermined threshold is0.6. Then, it can be judged in step S208 that the similarity between thethird candidate identifier “SQLServer®” and the source identifier “DB2”is greater than the predetermined threshold.

In step S209, this candidate identifier is selected as a targetidentifier associated with the source identifier.

At this point, it can be determined that the target identifierassociated with the source identifier is the third candidate identifier“SQLServer®.”

In the present invention, two identifiers being “associated with” eachother may represent that these two identifiers have a competitionrelation, a comparison relation, or any other proper predefinedrelation. Through the foregoing steps, it is possible to realize theprocedure of looking up a target identifier from a source identifier. Inpractical application, the product “SQLServer®” in a competitionrelation with the product DB2 can be found through this procedure oflookup.

In another embodiment of the present invention, suppose the sourceidentifier is “Jobs,” an entity in the “persons” category; and supposethe candidate identifiers include three entities in the “persons”category, namely “Zhang San,” “Bill Gates,” and “Obama.” After theprocessing in steps S203 to S209, it can be determined that “Bill Gates”is the target identifier according to the fact that the similaritybetween “Bill Gates” and “Jobs” is greater than the predeterminedthreshold. In this way, the retrieval of the associated targetidentifier from the source identifier is realized.

In step S210, a source object corresponding to the source identifier isdetermined.

In one embodiment of the present invention, the source identifier is“DB2.” Since it is a product of International Business Machine (IBM®)Corporation, it can be determined that a source object corresponding tothe source identifier “DB2” is “International Business MachineCorporation.” It should be noted that the source object can be anabbreviated name, an abbreviation, a general name of InternationalBusiness Machine Corporation, or any name that is capable of identifyingthe company and frequently used by users, such as “IBM,” etc.

In step S211, a target object corresponding to the target identifier isdetermined.

Like step S210, this step may determine a company to which a productrepresented by the target identifier belongs, according to the product.For example, for the target identifier “SQLServer®,” it can bedetermined that a target object corresponding to it is “MicrosoftCorporation.” It should be noted that the target object can be“Microsoft Corporation,” or an abbreviated name, an abbreviation, ageneral name of Microsoft Corporation, or any name that is capable ofidentifying the company and frequently used by users, such as“Microsoft®,” or “MS.”

In step S212, the source object is associated with the target object.

At this point, it can be determined that the target object associatedwith the source object (e.g., “IBM®”) is “Microsoft®.”

In the present invention, two identifiers being “associated with” eachother may represent that these two identifiers have a competitionrelation, a comparison relation, or any other proper predefinedrelation. Through the foregoing steps, it is possible to realize theprocedure of looking up a target object from a source object. Inpractical applications, by means of finding out that the productSQLServer® is in a competition relation with the product DB2, it can bedetermined that Microsoft® is in a competition relation with IBM®.

In an example of the present invention, when associating the sourceobject with the target object, an exemplary result can be outputted asbelow:

-   -   “IBM vs Microsoft (DB2 vs SQLServer)    -   “IBM vs Oracle (DB2 vs Oracle)    -   . . . ”

The foregoing result indicates that IBM® and Microsoft® have anassociation (e.g., competition) relation due to their respectiveproducts DB2 and SQLServer®; also IBM® and Oracle® have an association(e.g., competition) relation due to their respective products DB2 andOracle®.

Then, the flow of FIG. 2 ends.

It should be noted that steps S210 to S212 are not indispensable butoptional. The target identifier associated with the source identifier isalready capable of being determined in step S209. Steps S210 to S212expand this procedure, thereby realizing determination of the targetobject associated with the source object according to the associationbetween the source identifier and the target identifier.

In one embodiment of the present invention, before step S201, a sourceobject input by a user can be received (for example, a user inputs“IBM”), subsequently an identifier (e.g., “DB2”) corresponding to thesource object can be looked up in the data source, and the identifiercan be used as the source identifier used in steps S201 to S212. Itshould be noted that the source identifier is not limited to only comingfrom a source object input by a user; it can be directly inputted by theuser or obtained in any other proper manner those skilled in the art maycontemplate.

In another embodiment of the present invention, the procedure ofselecting a target identifier associated with the source identifier fromthe candidate identifiers according to the profile of the sourceidentifier and the profiles of the candidate identifiers can be furtherimplemented in the following manner: determining a temporal orderbetween the source identifier and the candidate identifiers based on theprofile of the source identifier and the profiles of the candidateidentifiers; and selecting a target identifier associated with thesource identifier from candidate identifiers when the temporal ordermeets a predetermined requirement.

In one specific implementation, temporal information related to thesource identifier can be recognized in the profile of the sourceidentifier, temporal information related to the candidate identifierscan be recognized in the profile of the candidate identifier, and atemporal order between the source identifier and each of the candidateidentifiers is determined by comparing the temporal information;afterwards, candidate identifiers that do not meet a predeterminedrequirement can be removed or filtered. For example, it can bedetermined that the source identifier “DB2” is released before or afterthe candidate identifier “SQLServer®”. When a predetermined requirementis that the source identifier should be released before the candidateidentifier, a candidate identifier released before the source identifier“DB2” is removed. Then, a candidate identifier released after the sourceidentifier “DB2” can be determined as a target identifier associatedwith the source identifier.

In another specific implementation, temporal information related to thesource identifier and temporal information related to the candidateidentifiers can be recognized from the profile of the source identifierand the profile of the candidate identifier, respectively. Then, atemporal order between the source identifier and each of the candidateidentifiers can be determined by comparing the temporal information;next, a candidate identifier that does not meet a determined requirementcan be removed or filtered according to the requirement; subsequently, atarget identifier can be selected from the candidate identifiersaccording to steps S205 to S209.

In another embodiment of the present invention, when there are arelatively large number of source identifiers and/or target identifiers,association relations between source identifiers and target identifierscan be built in the form of a graph, which are referred to as an“identifier association graph” for short. A vertex in the identifierassociation graph may correspond to a source identifier or a targetidentifier. An edge between two vertexes may correspond to anassociation relation between a source identifier and a targetidentifier, and the edge can be directional (e.g., shown by an arrow)that represents a temporal order between two vertexes. For example, anarrow pointing from the first vertex to the second vertex representsthat the second vertex appears or occurs at a time after the firstvertex. In addition, the identifier association graph may also berepresented in the form of text (e.g., TXT, XML, or other typical textmarkup tool). Furthermore, those skilled in the art would readilyappreciate that an association relation between identifiers can berepresented in various proper forms, without limitation to the graph ortext file that merely serves as an example here.

The identifier association graph can be accomplished in the background.According to the identifier association graph, the associated targetidentifier can be directly determined from the source identifier,thereby improving the real-time processing speed and increasing theprocessing efficiency.

In another embodiment of the present invention, when there are arelatively large number of source objects and/or target objects,association relations between source objects and target objects can bebuilt in the form of a graph, which is referred to as an “objectassociation graph” for short. Like an identifier association graph, avertex in the object association graph may correspond to a source objector a target object. An edge between two vertexes may correspond to anassociation relation between a source object and a target object, andthe edge can be directional (e.g., shown by an arrow) that represents aprecedence sequence between the two vertexes. It should be noted that anassociation relation between objects can be represented in variousproper forms, without limitation to the graph or text file that merelyserves as an example here.

The object association graph can be accomplished in the background.According to the object association graph, the associated target objectcan be directly determined from the source object, thereby improving thereal-time processing speed and increasing the processing efficiency.

FIG. 4 is a block diagram of an apparatus 400 for identifier retrievalaccording to one embodiment of the present invention. The apparatus 400for identifier retrieval may include: extracting means 410, obtainingmeans 420, and selecting means 430. The extracting means 410 can beconfigured to extract candidate identifiers from a data source accordingto a source identifier. The obtaining means 420 can be configured toobtain a profile of the source identifier and profiles of the candidateidentifiers from the data source. The selecting means 430 can beconfigured to select a target identifier associated with the sourceidentifier from the candidate identifiers according to the profile ofthe source identifier and the profiles of the candidate identifiers.

In one embodiment of the present invention, the extracting means 410 caninclude: named entity recognizing means configured to recognize namedentities from the data source; and candidate identifier extracting meansconfigured to extract, from the recognized named entities, identifiersbelonging to the same entity category as the source identifier, ascandidate identifiers.

In one embodiment of the present invention, the obtaining means 420 caninclude: source identifier profile searching means configured to searchthe data source for information related to the source identifier so asto be used as a profile of the source identifier; and candidateidentifier profile searching means configured to search the data sourcefor information related to the candidate identifiers so as to be used asprofiles of the candidate identifiers.

In one implementation, the source identifier profile searching means canfurther include: source identifier descriptive information looking upmeans configured to look up descriptive information on the sourceidentifier in the profile of the source identifier; and sourceidentifier profile updating means configured to update the profile ofthe source identifier with the descriptive information on the sourceidentifier.

In one implementation, the candidate identifier profile searching meanscan further include: candidate identifier descriptive informationlooking up means configured to look up descriptive information on thecandidate identifiers in the profiles of the candidate identifiers; andcandidate identifier profile updating means configured to update theprofiles of the candidate identifiers with the descriptive informationon the candidate identifiers.

In one embodiment of the present invention, the selecting means 430 caninclude: a calculating unit configured to calculate a similarity betweenthe source identifier and one of the candidate identifiers; and aselecting unit configured to select the one of the candidate identifiersas a target identifier associated with the source identifier when thesimilarity is greater than a predetermined threshold.

In one implementation, the calculating unit can include: source keywordextracting means configured to extract a source keyword from the profileof the source identifier; candidate keyword extracting means configuredto extract a candidate keyword from the profile of one of the candidateidentifiers; and similarity calculating means configured to calculatethe similarity between the source identifier and the one of thecandidate identifiers according to the source keyword and the candidatekeyword.

In one embodiment of the present invention, the selecting means 430 caninclude: temporal order determining means configured to determine atemporal order between the source identifier and each of the candidateidentifiers based on the profile of the source identifier and theprofiles of the candidate identifiers; and target identifier selectingmeans configured to select a target identifier associated with thesource identifier from the candidate identifiers when the temporal ordermeets a predetermined requirement.

In one embodiment of the present invention, the apparatus 400 foridentifier retrieval can further include: receiving means (not shown),which can be configured to receive a source object input by a user; andlooking up means (not shown), which can be configured to look up in thedata source an identifier corresponding to the source object to be usedas the source identifier.

In one embodiment of the present invention, the apparatus 400 foridentifier retrieval can further include: determining means (not shown),which can be configured to determine a source object corresponding tothe source identifier and a target object corresponding to the targetidentifier; and associating means (not shown), which can be configuredto associate the source object with the target object.

FIG. 5 schematically illustrates a structural block diagram of acomputing apparatus in which embodiments according to the presentinvention can be implemented.

A computer system as illustrated in FIG. 5 includes a CPU (centralprocessing unit) 501, RAM (random access memory) 502, ROM (read onlymemory) 503, a system bus 504, a hard disk controller 505, a keyboardcontroller 506, a serial interface controller 507, a parallel interfacecontroller 508, a display controller 509, a hard disk 510, a keyboard511, a serial peripheral device 512, a parallel peripheral device 513and a display 514. Among these components, the CPU 501, the RAM 502, theROM 503, the hard disk controller 505, the keyboard controller 506, theserial interface controller 507, the parallel interface controller 508,and the display controller 509 are connected to the system bus 504; thehard disk 510 is connected to the hard disk controller 505; the keyboard511 is connected to the keyboard controller 506; the serial peripheraldevice 512 is connected to the serial interface controller 507; theparallel peripheral device 513 is connected to the parallel interfacecontroller 508; and the display 514 is connected to the displaycontroller 509.

The function of each component in FIG. 5 is publicly known in thistechnical field, and the structure as shown in FIG. 5 is conventional.In different applications, some components can be added to the structureshown in FIG. 5, or some components shown in FIG. 5 can be omitted. Thewhole system shown in FIG. 5 is controlled by computer readableinstructions usually stored in the hard disk 510 as software, or storedin EPROM or other nonvolatile memories. The software can be downloadedfrom the network (not shown in the figure). The software stored in thehard disk 510 or downloaded from the network can be uploaded to RAM 502and executed by the CPU 501 to perform functions determined by thesoftware.

Although the computer system as described in FIG. 5 can support theidentifier retrieval apparatus according to embodiments of the presentinvention, it is merely one example of a computer system. Those skilledin the art would readily appreciate that many other computer systemdesigns can also realize embodiments of the present invention. Thepresent invention further relates to a computer program product, whichincludes non-transient program code for: extracting candidateidentifiers from a data source according to a source identifier;obtaining a profile of the source identifier and profiles of thecandidate identifiers from the data source; and selecting a targetidentifier associated with the source identifier from the candidateidentifiers according to the profile of the source identifier and theprofiles of the candidate identifiers. Before use, the code can bestored in a memory of a computer system, for example, stored in a harddisk or a removable memory such as a CD or a floppy disk, or downloadedvia the Internet or other computer networks.

The methods as disclosed in the present embodiments can be implementedin software, hardware or combination of software and hardware. Thehardware portion can be implemented by using dedicated logic; thesoftware portion can be stored in a memory and executed by anappropriate instruction executing system such as a microprocessor, apersonal computer (PC) or a mainframe computer. In an embodiment, thepresent invention is implemented as software, including, withoutlimitation to, firmware, resident software, micro-code, etc.

Moreover, the present invention can be implemented as a computer programproduct used by computers or accessible by computer-readable media thatprovide non-transient program code for use by or in connection with acomputer or any instruction executing system. For the purpose ofdescription, a computer-usable or computer-readable medium can be anytangible means that can contain, store, communicate, propagate, ortransport the program for use by or in connection with an instructionexecution system, apparatus, or device.

The medium can be an electric, magnetic, optical, electromagnetic,infrared, or semiconductor system (apparatus or device), or propagationmedium. Examples of the computer-readable medium would include thefollowing: a semiconductor or solid storage device, a magnetic tape, aportable computer diskette, a random access memory (RAM), a read-onlymemory (ROM), a hard disk, and an optical disk. Examples of the currentoptical disk include a compact disk read-only memory (CD-ROM), compactdisk-read/write (CD-R/W), and DVD.

A system adapted for storing and/or executing program code according toembodiment of the present invention would include at least one processorthat is coupled to a memory element directly or via a system bus. Thememory element may include a local memory usable during actual executionof the non-transient program code, a mass memory, and a cache thatprovides temporary storage for at least one portion of non-transientprogram code so as to decrease the number of times for retrieving codefrom the mass memory during execution.

An Input/Output or I/O device (including, without limitation to, akeyboard, a display, a pointing device, etc.) can be coupled to thesystem directly or via an intermediate I/O controller.

A network adapter may also be coupled to the system such that the dataprocessing system can be coupled to other data processing systems,remote printers or storage devices via an intermediate private or publicnetwork. A modem, a cable modem, and an Ethernet card are merelyexamples of a currently available network adapter.

The communication network mentioned in the specification may includevarious types of networks, including, without limitation, a local areanetwork (“LAN”), a wide area network (“WAN”), a network according to IPProtocol (e.g., the Internet), and a peer-to-peer network (e.g., an adhoc peer network).

It should be noted that some more specific technical details that arepublicly known to those skilled in the art and that might be essentialto the implementation of the present invention are omitted in the abovedescription in order to make the present invention more easilyunderstood.

The specification of the present invention has been presented forpurposes of illustration and description, and is not intended to beexhaustive or limited to the invention in the form disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art.

Therefore, the embodiments were chosen and described in order to bestexplain the principles of the invention and the practical application,and to enable others of ordinary skill in the art to understand that allmodifications and alterations made without departing from the spirit ofthe present invention fall into the protection scope of the presentinvention as defined in the appended claims.

1. A computer-implemented method for identifier retrieval, comprising:extracting candidate identifiers from a data source according to asource identifier; obtaining a profile of said source identifier andprofiles of said candidate identifiers from said data source; andselecting a target identifier associated with said source identifierfrom said candidate identifiers according to said profile of said sourceidentifier and said profiles of said candidate identifiers.
 2. Themethod according to claim 1, wherein extracting candidate identifierscomprises: recognizing named entities from said data source; andextracting as candidate identifiers, from said recognized namedentities, identifiers belonging to the same entity category as saidsource identifier.
 3. The method according to claim 1, wherein obtaininga profile of said source identifier and profiles of said candidateidentifiers comprise: searching said data source for information relatedto said source identifier so as to be used as a profile of said sourceidentifier; and searching said data source for information related tosaid candidate identifiers so as to be used as profiles of saidcandidate identifiers.
 4. The method according to claim 3, whereinsearching said data source for information related to said sourceidentifier further comprises: looking up descriptive information on saidsource identifier in said profile of said source identifier; andupdating said profile of said source identifier with said descriptiveinformation on said source identifier.
 5. The method according to claim3, wherein searching said data source for information related to saidcandidate identifiers further comprises: looking up descriptiveinformation on said candidate identifiers in said profiles of saidcandidate identifiers; and updating said profiles of said candidateidentifiers with said descriptive information on said candidateidentifiers.
 6. The method according to claim 1, wherein selecting atarget identifier associated with said source identifier from saidcandidate identifiers comprises: calculating a similarity between saidsource identifier and one of said candidate identifiers; and providedthat said similarity is greater than a predetermined threshold,selecting said one of said candidate identifiers as said targetidentifier associated with said source identifier.
 7. The methodaccording to claim 6, wherein calculating a similarity between saidsource identifier and one of said candidate identifiers comprises:extracting a source keyword from said profile of said source identifier;extracting a candidate keyword from said profile of one of saidcandidate identifiers; and calculating said similarity between saidsource identifier and said one of said candidate identifiers accordingto said source keyword and said candidate keyword.
 8. The methodaccording to claim 1, wherein selecting a target identifier associatedwith said source identifier from said candidate identifiers furthercomprises: determining a temporal order between said source identifierand said candidate identifiers based on said profile of said sourceidentifier and said profiles of said candidate identifiers; and providedthat said temporal order meets a predetermined requirement, selecting atarget identifier associated with said source identifier from saidcandidate identifiers.
 9. The method according to claim 1, prior toextracting candidate identifiers from a data source, further comprising:receiving a source object input by a user; and looking up in said datasource an identifier corresponding to said source object to be used assaid source identifier.
 10. The method according to claim 1, furthercomprising: determining a source object corresponding to said sourceidentifier; determining a target object corresponding to said targetidentifier; and associating said source object with said target object.